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ABSTRACT 

Although previous studies have shown that a large file of overlapping 
register windows can greatly reduce procedure call/return overhead, the 
effects of register windows in a multiprogramming environment are poorly 
understood. This paper investigates the performance of 
multiprogrammed, reduced instruction set computers (RISCs) as a 
function of window management strategy. Using an analytic model that 
reflects context switch and procedure call overheads, we analyze the 
performance of simple, linearly self-recursive programs. For more 
complex programs, we present the results of a simulation study. These 
studies show that a simple strategy that saves all windows prior to a 
context switch, but restores only a single window following a context 
switch, performs near optimally. 
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1. Introduction 


Although a return to simple instruction sets was first advocated by John Cocke and later 
successfully realized in the IBM 801 [Radi82], the Stanford MIPS [Henn84], and UC-Berkeley 
RISC-II [Patt82, Kate84], the source and magnitude of reduced instruction set computer (RISC) 
performance increases have been surrounded by controversy. One of the major contributing 
factors to the debate has been the presence of a large register file in the RISC-II design. The 
portion of RISC-IPs performance attributable to its register file has been contested [Hitc85] and 
has renewed discussions on register file design. Because this paper considers the management of 
RISC-II register files, we digress to briefly review their organization. 

RISC-II Register Design 

The UC-Berkeley RISC-II design [Patt82] provides each procedure invocation with a 
"window 1 * of 32 registers; see Figure 1. The window associated with a called procedure partially 
overlaps both the window of the calling procedure (the "high" registers) and the window of the 
next procedure called (the "low" registers). Thus, the "high" registers contain the parameters 
passed from the caller, and the "low" registers are used to pass parameters to the next callee. On 

Figure 1 RISC-II Register Organization 
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a procedure call, the "low” registers of the current window become the "high" registers of the 
callee’s window. The "local” registers are, as the name implies, available for use by the 
procedure. Finally, the "global" registers are shared by all windows. 

This overlapped register scheme reduces memory traffic in two important ways. First, 
rather than placing parameters on the stack prior to a procedure call, they can remain in 
registers. Second, by providing a sufficient number of overlapping register windows, the registers 
of the invoking procedure need not be saved prior to a procedure call. Of course, it is possible for 
the depth of the dynamic chain of procedure calls to exceed the number of register windows. In 
this case, some portion of the register file must be saved in memory to provide space for 
additional procedure invocations. Tamir [Tami83] has investigated strategies for solving this 
problem. 

Overview 

Although the RISC-II register file organization does reduce memory traffic due to procedure 
calls, its value is clouded by several pragmatic issues. First, the performance gains attributable 
to reduced procedure call overhead are lessened by the longer machine cycle time that results 
from capacitive loading of longer buses. Second, little is known about the behavior of multiple 
register windows in a multiprogramming environment, with its associated context switching. 
Certainly, the context switch overhead in a multiple register window architecture is greater than 
that in a single register set architecture, but it is not known if the performance gains due to 
reduced procedure call overhead are offset by larger context switch overheads. 

In this paper we evaluate the performance of a RISC— II processor with multiple register 
windows in a multiprogramming environment. In section 2, three window management 
strategies are discussed. Section 3 presents an analytic model of register management. Finally, 
section 4 presents the results of a simulation study that confirms the results obtained from the 
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analytic models. 

2. Register Management Strategies 

As mentioned above, Tamir [Tami83] investigated RISC-H register window management 
strategies for execution of stand-alone programs. In RISC-H, register windows form a last-in- 
first-out (LIFO) buffer. On a procedure call, the processor allocates an adjacent, overlapping 
register window, provided one is available in the register file. Otherwise, a register file overflow 
occurs, and one or more windows are pushed to memory, freeing a window for the pending 
procedure call. On a procedure return, the processor switches to the previously active window. If 
this window is no longer in the register file, an underflow occurs, and one or more windows are 
restored from memory. Two pointers, a current window pointer CWP and a saved window 
pointer SWP are used to manage windows in the LIFO buffer and to recognize window overflows 
and underflows [Kate84]. Tamir [Tami83] showed that the simplest management strategy, 
namely saving the oldest window on overflow and restoring one window on underflow was nearly 
optimal. Therefore in the remainder of this paper, we assume this strategy is used. 

Context Switching 

When a processor is multiprogrammed, the process associated with each program is 
suspended and resumed many times before completion. The operating system must preserve the 
state of the process at the end of each time slice. If the processor contains a single register set, 
this preservation typically entails copying the contents of all registers to memory. If the 
processor has many register windows, the context switch overhead includes, in principle, saving 
all active register windows. For a machine like RISC-II, this cost can be large. Fortunately, 
there are several alternative register management schemes, and some avoid saving all registers. 
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Suppose a process occupies n register windows at the time of a context switch. In general, 
let 

Strategy (k, j) 1 < k < n 1 < j < N 

denote saving k windows and restoring j windows, where N is the current depth of procedure 
calls. Note that N can be greater than n if windows have been saved on the window stack in 
memory. Several strategies of this form are possible. In this paper, we consider three: Strategy 
(n, n), Strategy (n, 1), and Strategy (0, 1). 

Strategy (n, n) 

The obvious extension of context switching to a multiple window register file simply saves 
all active windows of the current process prior to context switching and restores those same 
register windows when the process receives its next time slice. Because the complete state of each 
process is restored prior to its time slice, the probability of register window underflow or overflow 
is independent of the multiprogramming mix. 

Empirical data suggest that most programs exhibit nesting depth locality. Specifically, the 
dynamic depth of procedure calls changes only a small amount over long periods of time, even if 
the maximal chain of dynamic calls is high. 1 Indeed, this is the primary reason a small set of 
register windows on RISC-H can cache most sub-sequences of calls [Patt85]. However, because 
nesting depth shows only a small variation with time, the register file is likely to contain many 
register windows that will not be used by the process until far in the future. By analogy with 
virtual memory, the register file contains more windows than those constituting the "working 
set" of the process. As a result, Strategy (n, n) will often restore windows that will not be used 
before the next context switch. 


'This does not mean that there are few procedure calls, merely that the depth of calls changes little. 
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Strategy (n, 1) 

Rather than restoring all windows, we might restore only the window corresponding to the 
currently active procedure. This reduces the cost of a context switch and, because each process 
resumes with more free windows, also reduces the probability of register file overflow. However, 
because only one window is restored, more register file underflows will occur than if the process 
ran by itself. Suppose that a process needs all windows saved during the previous context switch. 
Underflows will cause these windows to be restored singly, and the total cost will be greater than 
if they were restored enmasse. Certainly, the size and number of windows will determine 
whether this difference is important. Table 2 shows that the context switching cost for saving n 
RISC-H windows takes the form an + b and that bulk restores are more efficient. 

Unlike Strategy (n, n), the length of the time slice interacts with Strategy (n, 1) to change 
the overflow and underflow probabilities. Strategy (n, 1) has lower context switching cost, but 
potentially higher procedure underflow costs. Because the efficacy of the two strategies depends 
on the number of windows in the register file, the dynamic chain of procedure calls, and the 
number and size of registers windows, itus difficult to predict a priori their relative performance. 

Strategy (0,1) 

The two strategies proposed above save all active register windows. Clearly, context 
switching overhead is minimized if no windows are automatically saved: Instead, windows can be 
saved as needed. This approach is similar to that used with caches. That is, register windows 
remain in the register file until their space is needed. If no intervening process needs the space, a 
process may find that some of its register windows are still in the register file at the beginning of 
its next time slice. 

With this strategy, a window overflow or underflow trap procedure must be able to 
determine the owner of each register window. Therefore, a process identifier register and an 
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occupancy flag must be associated with each window, and procedure calls must load these 
registers appropriately. Finally, the order of a process’ windows in memory must be preserved. 
If the youngest window belonging to a process is saved before older windows, space in the 
memory stack must be reserved for those intervening windows. 

When a process resumes execution after a context switch, its register windows may appear 
in several possible states. 

(1) No windows belonging to the resumed process are in the register file and either 

(a) no free windows exist, or 

(b) at least one free window exists. 

(2) At least one window belonging to the resumed process is in the register file and either 

(a) the process’ most recently active window is in the register file, or 

(b) the process’ most recently active window is not in the register file. 

The necessary action differs in each case. 

In cases (la) and (lb), the window belonging to the active procedure of the process must be 
restored from the top of the corresponding memory stack. There are several possibilities for its 
placement in the register file. If the process that just relinquished the processor left free windows 
(i.e., the process could have executed another procedure call without window overflow) one of 
these can be allocated. However, even this poses alternatives. As Figure 2 shows, it is possible to 
restore the process window to the free window following the one pointed to by the CWP (current 
window pointer) of the previous process, window (a). This permits the maximum number of calls 
before a window must be written to memory. Alternatively, a free window just before the one 
pointed to by the SWP of the previous process can be allocated, window (b) in Figure 2, giving 
preference to returns. As a compromise, a free window between the two pointers would give 
equal preference to both calls and returns. These choices will determine how windows in the 
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Figure 2 Strategy (0,1) Window Management 



Call: 

CWP := {CWP + 1) mod N 
if CWP = SWP then overflow 

Return: 

CWP— (CWP - 1) mod N 
if CWP = SWP then underflow 

register file are populated. 

If the process active during the previous time slice did not leave any free windows (i.e., 
another call would have caused an overflow), two possibilities exist. Either the entire register file 
is full, or there exists at least one free window in the register file that is not in the contiguous 
region between CWP and SWP used by the previous process. In the first case, one or more 
windows should be saved in memory. In the second case, a free window must be located. The 
“placement” of the resumed process window in the register file is analogous to the cache 
replacement strategies [Smit82]. Like those strategies, it must be fast and efficient. 

The preceding discussion concerned only cases (la) and (lb), when no windows belonging to 
the process remained in the register file at the beginning of its time slice. If the register file is 
large, or the multiprogramming level is low, some windows belonging to the process may remain 
in the register file. This is analogous to a "warm start" in a cache [Smit82]. However, window 
restoration can be completely avoided only if the topmost portion of the window stack belonging 
to the process still resides in the register file. Otherwise, either the portion of the window stack 
still in the register file must be augmented with those windows in memory; or the portion of the 
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window stack in the register file must be saved in memory, and the topmost window of the stack 
loaded from memory. These alternatives are necessary if the stack structure of the windows is to 
be maintained in the register file. 

Clearly, there are many possible implementations of Strategy (0, 1). The similarities with 
caches are obvious, although subtle differences exist, primarily because the order of windows in 
the register file reflects the call/return sequences of processes. Window replacement policies must 
maintain this order. Finally, additional hardware is needed for Strategy (0, 1) implementation; 
for details on the hardware requirements, see [Watc87]. 

Space precludes a complete analysis of Strategy (0, 1). Hence, in this paper, we assume that 
Strategy (0, 1 ) must maintain a contiguous group of register windows in both the register file and 
associated memory stack of each process. Moreover, if any windows belonging to a process 
remain in the register file, we require that the most recently used window also remain in the 
register file. Saving this window in memory forces the saving of all other windows belonging to 
the process. Thus, the most recently used segment of windows belonging to a process remains in 
the register file. If a process regains control of the processor and finds that its current window is 
missing, the first group of free windows, beginning at window zero, is used to allocate a window 
for the process. This window is located at the middle of the free group, giving equal preference 
to procedure calls and returns. Finally, if no free windows exist, the least recently used window 
of the process relinquishing the processor is replaced. This maximizes the time until the replaced 
window is needed. 

In the next section, we formalize the interdependency of context switching and window 
underflow/overflow as an optimization problem and show how it can be analyzed for simple 
programs. Following that, we compare the performance of the context switching strategies just 
described using trace driven simulations. 
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3. Analytic Models of Register Management 

As we have just seen, the window management strategy used for context switching can 
affect the register file overflow and underflow probabilities. Moreover, increasing (decreasing) the 
size of the register file will decrease (increase) overflow and underflow probabilities while 
increasing (decreasing) the context switching cost. Abstractly, however, the execution time of a 
program, measured in machine cycles,^ depends on the number of program instructions executed, 
the window management cost for procedure calls, and context switching overhead. Selecting a 
window management strategy is then reduced to the following optimization problem, 

minimize ExecutionTime (P , W, Ts) (1) 

VPZMPSet 

subject to 1 < VT < W m9X 
1 <Ts< Ts max 

where P is a program in the multiprogramming set MPSet , W is the number of windows in the 
register file, and Ts is the time slice. The execution time, in turn, is given by 

ExecutionTime (P , W, Ts) = Instructions (P) + (2) 

Context (Pj W y Ts) + Overflow (P , W, Ts) 

where Instructions (P) is the execution time of a program without procedure and context switch 
overhead, Context{Py W, Ts) is the cost of context switching, and Overflow (P , W, Ts) is the 
cost of window management during procedure call and return. 

Because this optimization problem depends on the multiprogramming mix and the 
interaction of programs with the context switching strategy, there is little prospect of solving the 

2 Because all RISC-II instructions other than load or store execute in a single cycle [Kate84], modeling program 
execution time is straightforward. 
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general case. However, for many interesting cases, closed form solutions are possible. Although 
these solutions might initially appear to be of marginal value, they provide insight into the 
interaction of the parameters. The following section analyzes the performance of self-recursive 
programs as a function of time slice; see [Watc87] for an extension to other program classes. 

Self-Recursive Programs 

Consider a class of programs of self-recursive programs where, at any depth, the 
probability of an additional call is p . Then the distribution of procedure depths is binomial, and 
the expected depth for any execution of the program is 


D = 



( 3 ) 


Let T be the execution time of each procedure, and, for simplicity’s sake, let each procedure call 


T . T 

occur at the point — . That is, each procedure invocation executes for — time units, recursively 

2 2 


invokes itself, and following the return of the recursive call, executes for an additional — time 

2 


units. Then the mean program execution time, exclusive of procedure call and context switching 
overhead, is 


Instructions (P) = TD . 


( 4 ) 


Now consider Strategy (n, n) that saves and restores the complete context of each process. 
Because the program state is unchanged after each context switch, the procedure call overhead is 
independent of the time slice. If the depth of procedure calls D is less than the number of 
register windows W , there is no procedure call overhead. Otherwise, each call of depth greater 
than D causes both a window overflow and underflow. Hence, the overflow cost is 


Overflow(P , W , Ts) 


0 

S 


D < W 


( 5 ) 


D - W 


D > W, 



% 


where S is the cost to save and restore a single register window. 
Finally, the time slice Ts can be either smaller 


or larger 


Ts 


T_ 

2k 


* = 1, 2, V • • 


Ts = 


Tk 


k = l,2, 3, 


than — . In the first case, each procedure call suffers multiple context switches. Conversely, 

there are several procedure calls per time slice in the second case. We consider the two cases 
separately. 


Case Ts = — : 

2k 

As Figure 3a shows, the procedure at depth d suffers context switches with d windows in 
the register file, both before and after executing its recursive call. Thus, the context 
switching cost for Strategy ( n , n) is 


Context(P , W , Ts) = 2 kSJJd = *££>[/) +l| (6) 

d- 1 L J 

if P ^ W. Recall that k is the number of context switches per procedure invocation, and S 
is the cost to save and restore one window. 


Similarly, if the mean depth of calls D exceeds the number of windows W, Figure 3b shows 
that the D — W procedure invocations that overflow the register file suffer context 
switching cost kWS{D — W) before their recursive calls and cost kS(D — W) after their 
recursive calls return. Why? On the downward chain of calls, the register file fills, and each 
context switch must save the entire register file of W windows. On the upward chain of 



12 


returns, the register file empties, and each context switch saves only a single window. Thus 
the context switching cost is 


w 


Contczt(P, W, Ts) = 2 kSJJd 

4-1 


kWS^D — wj + kS^D - wj 


(7) 



if D > W. 

Case Ts = 

2 

This case is similar to the previous one except that the context switching interval exceeds 
T, the procedure execution time. Thus, successive context switches see the number of 
allocated register windows grow by increments of k. The number of context switches on the 


downward chain of calls is 



if D < W, and the context switching cost is 


Context (P, W, Ts) = SD 


1 + 


D_ 

k 


( 8 ) 


Similarly, if D > W, the context switching cost is 


Context(P , W, Ts) = S 


w 


w 

+ 

D — W 

+ 1 

+ 

D - W 



k 

k 

k 

? 


■ 






J 


(9) 


see [Watc87] for a complete derivation of these formulae. 

Inspecting these equations shows that the overflow cost (5) is a linearly decreasing function 
of the number of windows W in the register file. Similarly, the context switching cost equations 
(7) and (9) are linearly increasing functions of W. If aW + 6 denotes the overflow cost, and 
C W + d denotes the context switching cost 3 , the linear combination (a + b)W + [c + d) can 
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have either positive or negative slope. If the slope is positive^ the total overhead will increase 
with the number of register windows. Conversely, with negative slope, overhead is minimized 
with a small number of windows in the register file. The optimal choice depends on the time 
slice, cost of register saves and restores, and the depth of procedure calls. 

Figure 4 illustrates one combination of values based on the derivation just presented. 
Similarly, Figure 5 shows the interaction of context switching cost and procedure overflow cost 
on an actual, linearly self-recursive program, factorial, when time sliced on RISC-II with varying 
sized register files and the register management costs given in Table 2. The critical dependence 
of Strategy (n, n) on so many parameters suggests that it is inappropriate for a 
multiprogramming environment. 

Analysis of program behavior is not restricted to linearly, self-recursive programs nor to 
just Strategy (n, n). The technique has been applied to programs whose call probability is a 
function of depth and to programs with richer patterns of call behavior (e.g., trees). Moreover, 
other window management strategies, including Strategy (n, 1) and variations of Strategy ( 0 , 1) 
are amenable to this technique [Watc87]. 

4. A Simulation Study of Register Management 

Although the analysis in the previous section does provide insight into the behavior of 
certain program classes, it cannot be used to precisely predict the performance of real program 
mixes. For this, trace driven simulation is needed. 

Selection of benchmarks for trace driven evaluation is always difficult. The desire for 
generality must be balanced against the cost of simulating many program traces. Reducing the 
number of benchmarks to reduce simulation costs means that the remaining benchmarks must 


’Table 2 shows just such cost functions for RISC-II. 
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reflect "typical” behavior. Moreover, continuity with other studies [Patt82, Hitc85] is necessary 
to maintain a standard of reference. 

The simulation experiments reported below were based on nine benchmarks. The first six 
are those used during the RISC-II evaluation and permit comparison with previously reported 
studies [Patt82, Tami83, Hitc85]. Because many of these programs have been criticized as 
procedure-call intensive, the set was augmented with three other programs: the Dhrystone 
synthetic benchmark [Weic84], the UC-Berkeley RISC-II simulator (Rsxrn) executing the 
Fibonacci program, and the sed editor editing a 500-line UNIX manual. 

Table 1 shows the characteristics of these benchmarks when executed stand-alone on 
RISC-II with 8 register windows. The call/return instruction frequencies and call/return 
memory traffic shown in Table 1 include instructions for saving and restoring both local registers 
and environment registers (e.g., program counter and stack pointer). The maximal procedure 


Table 1 Benchmark characteristics 


Benchmark 


Call/return 
memory trafficf%l 

MND 1 



ARM 4 

Ackerman 

17.4 

49.2 

512 


6 

3.5±1.5 

Fibonacci 

21.9 

42.9 

21 

2.00 

4 

2.3±0.8 

Hanoi 

16.7 

48.7 

20 

2.00 

10 

3.5±2.5 

ESllBi 

0.8 

6.6 

19 

1.21 

11 

4.5±1.2 

Puzsub 

0.7 

3.3 

19 

1.10 

10 

2.3±1.0 

Qsort 

9.7 

27.3 

10 

1.01 

15 

2.6±2.3 


8.6 

22.4 

5 

1.25 

12 

3.8±1.5 

Rsim 

0.8 

2.6 

6 

1.06 

12 

2.2±0.7 

Sed 

1.4 

5.9 

7 


11 

4.7±1.4 


l MND - maximal nesting depth 

2 ACS - average length of call sequences 

3 WRR - number of registers in a window referenced in the benchmark 
*ARM - average number of registers modified in a procedure 
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nesting depth MND is the minimum number of register windows sufficient to avoid both register 
window overflow and underflow. Similarly, ACS is the average number of sequential procedure 
calls before a return. 

Comparison of the benchmark set characteristics with those for other workloads [Clar82, 
Wiec82, Emer84] shows that at least three program classes were included in the set: 

• procedure intensive with a greater than normal frequency of procedure calls ( Ackerman , 
Fibonacci, and Hanot), 

• procedure typical with average procedure call frequency ( Dhrystone , Qsort ), and 

• procedure parsimonious with minimal procedure call frequency (Puzpnt, Puzsub , Rsim , Sed). 
With exception of the Ackerman benchmark, the dynamic pattern of procedure nesting depth 
confirms the locality of procedure nesting. 

Simulated Multiprogramming 

Like cache performance studies [Smit82], the performance of RISC-II context switching 
strategies depends on the multiprogramming mix and the process scheduling algorithm. The 
experiments presented below were based on a simple, round-robin scheduling algorithm with a 
fixed time slice. As defined, Strategy (n, n) and Strategy (n, 1) are independent of the mix of 
programs, only the length of the time slice is important. For these two strategies, it suffices to 
simulate programs singly and calculate the context switching cost at fixed multiples of the time 
slice. 

The performance of Strategy ( 0 , 1) does depend on the mix of programs. Thus, it was 
necessary to capture program instruction traces and simulate context switches among the traces 
[Kons86]. In all cases, a multiprogramming level of three was used. Among the three program 
classes discussed earlier, three mixes were created. 
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Mix 1 is a combination of programs from each of the three classes, procedure intensive, 
typical, and parsimonious. Fibonacci, Dhrystone and Puzpnt are used in this mix. 

Mix 2 is a group of homogeneous programs drawn from the same class. Three combinations 
are possible here. First, Fibonacci (two copies) and Hanoi constitute a mix of procedure intensive 
programs. Second, Dhrystone (two copies) and Qsort are a "typical" mix. Finally, Puzpnt (two 
copies) and Sed constitute a mix of procedure parsimonious programs. 

Mix 3 is similar to the first mix, except that Fibonacci was replaced by the Ackerman 
program. Ackerman’s absence of procedure nesting locality can degrade the performance of the 
entire multiprogramming mix [Watc87]. 

The choice of appropriate time slices for simulations is a difficult problem, because it 
depends on the hardware/software environment. For the VAX— 11/780 with VMS, the average 
time slice has been measured to vary between 1,812 and 9,729 instructions [Emer84, Clar85]. To 
cover a range of possibilities, we repeated all experiments for the following time slices, measured 
in cycles: 500, 1000, 1500, 5000, 10000, and 20,000. 

Performance Measurement 

Procedure call and return overhead was calculated using the product of the number of 
window overflows and underflows and the execution time of the trap procedure servicing these 
events. Similarly, the context switch overhead was assumed to be the product of the number of 
context switches and the execution time of context switching algorithm. Table 2 shows these 
costs, obtained from an analysis of the RISC-II assembly code for each operation. 

As stated before, the execution times of the benchmarks were used as a measure of the RISC 
performance. The procedure and context switch overheads, Overflow (P , W, Ts) and 
Context(P , W, Ts) in equation (2), were monitored separately to study behavior in two 
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Table 2 Procedure and context switch overhead (cycles) 


Strategy 

Window 

Overflow 

Window 

Underflow 

Context 

Save 

Context 

Restore 

■iSSi 

54 

57 

37 + I1X43 1 

27 + I1X45 1 

1 

54 

57 

37 + 11X43 1 

67 

worm 

54 

57 or 94 + nX43 2 

41 



*n - number of active windows, 
implementation strategy dependent 


execution environments: stand-alone mode and in multiprogramming mode. The ratios of these 
overheads to the program time (i.e., Context /Instructions, Overflow/Instructions , and (Overflow 
+ Context) /Instructions) were used as performance metrics. Note that Instructions is the 
optimal performance, the execution time without procedure and context switch overhead (i.e., for 
the infinite number of windows and no context switching). 

4*1. Stand-alone Program Execution 

Figures 6 and 7 show the procedure overhead for selected members of the benchmark set as 
a function of the number of windows in stand-alone mode. For most benchmarks, the procedure 
overhead becomes negligible long before the number of windows approaches the benchmark’s 
maximal nesting depth. For those benchmarks with parsimonious or typical procedure call 
frequencies, four windows suffice to reduce the procedure overhead to less than 2 percent of the 
program execution time. For highly recursive benchmarks such as Fibonacci and Hanoi, the 
procedure overhead is less than 6 percent when the the number of windows exceeds 10. Only the 
Ackerman benchmark shows anomalous behavior. With a maximum procedure nesting depth of 
512 and little locality in the pattern of procedure calls, the Ackerman benchmark benefits little 
from multiple register windows. This produces the very high procedure call overhead. 



















18 


4.2. Execution in a Multiprogramming System 

Space precludes a complete presentation of the simulation results; see [Watc87] for details. 
Hence, we concentrate on two of the three program classes, procedure typical and procedure 
intensive using the Dhrystone and Fibonacci benchmarks as representatives; other benchmarks 
yield similar results. In all cases, we show the overhead for procedure calls and context switching 
as a function of the program execution time in stand-alone mode with an infinite number of 
register windows. 

Strategy (n, n) 

Figures 8 and 9 show that, for each context switching interval, there is an optimal number 
of windows. As the number of windows in the register file increases, the probability of window 
overflow decreases. Simultaneously, the cost of each context switch increases. These two trends, 
one increasing cost, the other decreasing cost, yield an optimal number of windows for a given 
context switching interval. This is in apparent contrast to the analytic results obtain earlier. 
Recall, however, that the Fibonacci benchmark is not linearly recursive. Instead, its pattern of 
calls (i.e., Fibonacci (n) = Fibonacci (n-l) + Fibonacci (n-2)) form a tree of procedure call 
depths. This is illustrated in Figure 10. This behavior yields a quadratic cost function for 
overhead; whence the minima in Figure 9. 

Table 3 shows that the optimal number of windows for each benchmark is not a constant; it 
depends on the time slice. For small time slices, it is more important to minimize the number of 
register windows because these windows must be saved frequently. As the time slice increases, 
procedure overflow and underflow overheads dominate, favoring use of additional windows. 

Two final points about Figures 8 and 9 should be noted. First, the optimal number of 
register windows depends on the program type. For programs with modest procedure call depth, 
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Table 3 Optimal number of windows for Strategy (n, n) 


Benchmark 

Time slice (cycles) 

0.5K 

IK 

1.5K 

5K 

10K 

20K 

Fibonacci 

9 

9 

11 

13 

13 

15 

Hanoi 

7 

m 

9 

9 

11 

11 

I3B39B 

2 

2 

3 

3 

4 

5 

Puzsub 

2 

2 

2 

2 

3 

4 

Em 

2 

2 

2 

mm 

4 

5 

E 

4 

4 

4 

4 

4 

4 

Rsim 

2 

3 

3 

3_j 

3 

3 

Sed 

3 

3 

3 


5 

5 


a small number of register windows is best. Using too many windows retains portions of the 
dynamic call chain that are not in the "working set" of windows, resulting in excessive context 
switching overhead. Likewise, using too few windows causes "window thrashing." The sensitivity 
of programs to the number of windows is striking, as the Dhrystone benchmark illustrates. In 
contrast, highly recursive programs like Fibonacci have a large window "working set" and need 
more windows. Second, the absence of a single register set size that minimizes execution time for 
all programs suggests that Strategy ( n , n) is a poor candidate for register window management in 
a multiprogramming environment. 


Strategy (n, 1) 

As Figures 11 and 12 show, restoring a single window following a context switch greatly 
reduces the overhead, compared to Strategy (n, n). For all classes of benchmarks, the overhead 
approaches an asymptote as the number of register windows grows. Table 4 shows the number 
of register windows that yields execution time within 1 percent of the minimal execution time 
achievable with a infinite number of windows. A comparison with Table 3 shows that the values 
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Table 4. Optimal number of register windows under Strategy (n,l ) 


Benchmark 

Time slice values ( cycles ) 

0.5K 

IK 

1.5K 

5K 



Fibonacci 

11 

13 

11 

13 

13 

13 

Hanoi 

11 

13 

11 

9 

11 

11 


3 

3 

3 

3 

3 

3 

Puzsub 

2 

2 

2 

2 

2 

2 

Qsort 

2 

3 

3 

3 

3 

3 


4 

n 

4 

n 

4 

4 

Rsim 

2 

3 

3 

3 

3 

3 

Sed 

4 

4 

4 

4 

4 

5 


in Table 4 are slightly larger. Recall that Strategy (n, 1) restores only a single register after a 
context switch. Thus, the mean number of windows a process can maintain in the register file is, 
for a fixed size register file, smaller for Strategy (n, 1) than for Strategy (n, n). This favors a 
slightly larger register file for Strategy (n, 1). 

Because the performance of Strategy ( n t 1) is monotonic in the size of the register file, it is a 
promising candidate for a multiprogramming environment. A register file large enough to 
accommodate highly recursive programs is also optimal for procedure parsimonious programs. 

Strategy (0,1) 

Figure 13 shows the procedure and context switch overhead under Strategy(0,l) for both 
mix 1, a mixture of program types (Fibonacci, Dhrystone, and Puzpnt), and mix 2, a 
homogeneous program group (Dhrystone, Dhrystone, and Qsort). Comparing Figure 13 to 
Figure 11 shows that, within the range of 2 to 16 windows, Strategy ( 0 , 1) is generally inferior to 
Strategy (n, 1). There are two principal reasons for this performance gap. 

First, recall that Strategy(0,l) can potentially find the most recently used window of the 
process still in the register file, a "window hit." However, detailed examination of the simulations 
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showed that, in mix 1, the hit ratio for one program was less than 10 percent and did not exceed 
60 percent for any program in the mix. Similarly, the hit ratio ranged from 10 to 80 percent for 
the programs in mix 2. When the most recently used window is not in the register file, the 
window underflow trap procedure must search for free windows in the register file. In most cases 
the register file was full, leading to large overheads. 

Second, because the register file utilization is so high, processes compete for free windows. 
In other words, the overall performance is degraded by interference among processes. This 
competition for windows can result in anomalies for certain processes in jobs mixes (i.e., a larger 
register file can actually increase the overhead); see Figure 13a. The effects of competition are 
most pronounced for small time slices. Each process spends a large portion of its time slice 
fetching register windows from memory. 

To overcome the window management overhead and the interference effect, Strategy (0, 1) 
requires a larger register file. This will increase the hit ratio and increase the window allocation 
for each process. Figure 14 shows the performance of Strategy(0,l) for 16 to 80 windows on the 
Dhrystone benchmark. For a a large enough register file, the hit ratio approaches 100 percent. 
Table 5 shows the overhead ratios for Strategy (0,1) and Strategy ( n , 1) with an infinite number 
of windows. For programs with typical procedure call patterns (e.g., Dhrystone), Strategy ( n , 1) 


Table 5 Ratio of overheads Strategy (n, l):Strategy (0, 1) 


Slice 

Mix 1 

Mix 2 



Fibonacci 


■Sill 

Fibonacci 

1.5K 

6.67 

6.92 

20.21 



20.21 

5K 

4.41 

7.38 

17.96 

6.17 

7.76 

17.96 


4.15 

7.94 


5.96 

8.35 

n 


3.17 

8.57 

13.37 

4.86 

9.04 

13.37 
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has roughly 5 times the overhead of Strategy (0, 1). For heavily recursive programs, the 
asymptotic overhead ratio approaches 20. This is tempered by the knowledge that the absolute 
overhead of both schemes is relatively small for a sufficient number of windows. In this light, 
Table 6 shows the number of windows necessary for Strategy (0, 1) to yield lower overhead than 
Strategy (n, 1). As can be seen, this depends heavily on the program mix. 

The effects of the program mix on the performance of Strategy (0, 1) and the variation in 
size of the register file necessary to optimize performance suggest that Strategy (n, 1) is likely 
preferable. However, Strategy (0,1) should be investigated further. 

5. Conclusions 

We have presented three window management strategies for a multiprogrammed RISC-II 
processor. The simplest strategy saves all active windows belonging to a process at the end of its 
time slice. Upon resumption, all windows are restored. Although this technique, Strategy (n, n), 
requires no modification to the existing RISC-II hardware, we showed via analytic models that 
the optimal size of the register file depends on the context switch interval and the pattern of 
procedure calls. This was confirmed via trace driven simulation. This suggests that this strategy 
is inappropriate for a multiprogrammed environment. 


Table 6 

Number of windows where 
Strategy (0, 1) is preferable to Strategy (n, 1) 


Slice 

Mix 1 

Mix 2 

Dhrystone 

Puzpnt 

Fibonacci 

Dhrystone 

Puzpnt 

Fibonacci 

1.5K 

20 

20 

12 

10 

10 

28 

5K 

24 

24 

12 

10 

12 

32 

10K 

28 

24 

14 

12 

14 

32 

20K 

28 

28 

14 

12 

16 

40 
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The second approach, Strategy ( n , l) f saves all active windows upon a context switch, but 
restores only one. Simulations showed that it is uniformly superior to the first strategy. 
Moreover, the context switching overhead decreased asymptotically with larger register files. As 
before, no modification to existing hardware is necessary. This suggests that a single, large 
register file can provide good performance in a multiprogrammed environment. 

The final technique, Strategy (0, l) y treats the register file as a cache, saving windows only 
when their space is needed. The performance of this strategy is sensitive to the mix of programs, 
unlike either of the other strategies. Although a larger register file is necessary to achieve good 
performance, this strategy is asymptotically superior to either of the two strategies that save the 
entire context each time slice. As we noted at the outset, there are many variations of Strategy 
(0, l) y based on the window replacement algorithms used. Further experimentation is needed to 
determine if the hardware costs of this approach are offset by increased performance. 
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Figure 3(a) W > D, k = 1 
Context switching and procedure call overheads 
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Figure 3(b) W < D , k = 1 
Context switching and procedure call overhead 
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Time (x S) 



Number of Windows 


Figure 4 Procedure and Context Switch Overhead; 


D = 20, T, = -y-, k = 6 


Time (x 100 cycles) 



Figure 5 Procedure and Context Switching Overhead 
Factorial 100, Ts = 800 cycles 
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% of Program Time 



Figure 6 Procedure Overhead: 
Dhrystone, Sed. 
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Figure 7 Procedure Overhead: 
Ackerman, Fibonacci, .... Hanoi. 
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Figure 8 Procedure and Context Switch Overhead: 
Strategy (n,n), Dhrystone 
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Figure 9 Procedure and Context Switch Overhead: 
Strategy (n,n), Fibonacci 
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Figure 10 Procedure Nesting Depth of Fibonacci: 
Minimum, .... Maximum 
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Figure 11 Procedure and Context Switch Overhead: 
Strategy (n,l), Dhrystone 
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Figure 12 Procedure and Context Switch Overhead: 
Strategy (n,l), Fibonacci 
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% of Program Time 



(a) Mix 1 (Fibonacci, Dhrystone, and Puspnt). 


% of Program Time 



(b) Mix 2 (Dhrystone, Dhrystone, and Qsort) 

Figure 13 Procedure and Context Switch Overhead: 
Strategy (0, l) for Dhrystone 
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